According to the Center for Disease Control and Prevention (CDC), cardiovascular or heart diseases account for 1 in every 4 deaths in the United States annually. The estimated costs to the nation’s economy amount to about $363 billion annually, which include hospitalizations, after services health care needs as well as lost productivity due to death and disability (SS Virani, 2021). The CDC also states that given the lifestyle habits of an average American, about half of the population (47%) are reported to have “1 of the 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking”. However, the risk factors for developing heart diseases are vast and diverse and tend to vary between most racial and ethnic groups in the United States.
Therefore, there is an immediate need to detect the major risk factors early on, so that a wellness-oriented healthcare system can be instituted. Ideally, such a system would assess the intensity of the risk factors (compared to a baseline) and focus efforts on preventing the occurrence of developing heart diseases in the first place rather than minimizing second occurrences (Liau et al, 2010; Giampaoli et al, 2005 ).
The beginnings of the current understanding of the interconnected role of risk factors was first discussed in the Framingham Heart Study. Started in 1948, the study’s original goal was “ to identify common factors or characteristics that contribute to cardiovascular disease” (Hong 2018). However, today the FHS has morphed into a multigenerational study gathering genetic information that help analyze family patterns of certain diseases, including cardiovascular ailments (National Heart, Lung and Blood Institute). Jumping off from this, current research in the field has focused on using precision medicine algorithms along with computational development technology to assess a patient’s likelihood of developing heart disease given certain lifestyle factors. Most of the studies have focused largely on Caucasian individuals as a sample and indicating that there is a gap in the usefulness of current risk assessment tools working for individuals of different races.
In this paper, we aim to understand the different risk factors that affect the occurrence of heart diseases among the US population, so that we may be able to see some of the intermingling between different risk factors.
The original data set can be retrieved from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) annual health survey. The survey interviews over four hundred thousand individuals in the USA asking questions regarding the risk factors of heart disease. Conducted in 2020 (published on February 15, 2021), the original data set has been cleaned from its original 300 variables to 18 variables by Kamil Pytlak and can be accessed on Kaggle .
# Load the Data
heart <-read.csv("heart_2020_cleaned.csv")
# Descriptive Statistics
status (heart)
## variable q_zeros p_zeros q_na p_na q_inf p_inf
## HeartDisease HeartDisease 0 0.000 0 0 0 0
## BMI BMI 0 0.000 0 0 0 0
## Smoking Smoking 0 0.000 0 0 0 0
## AlcoholDrinking AlcoholDrinking 0 0.000 0 0 0 0
## Stroke Stroke 0 0.000 0 0 0 0
## PhysicalHealth PhysicalHealth 226589 0.709 0 0 0 0
## MentalHealth MentalHealth 205401 0.642 0 0 0 0
## DiffWalking DiffWalking 0 0.000 0 0 0 0
## Sex Sex 0 0.000 0 0 0 0
## AgeCategory AgeCategory 0 0.000 0 0 0 0
## Race Race 0 0.000 0 0 0 0
## Diabetic Diabetic 0 0.000 0 0 0 0
## PhysicalActivity PhysicalActivity 0 0.000 0 0 0 0
## GenHealth GenHealth 0 0.000 0 0 0 0
## SleepTime SleepTime 0 0.000 0 0 0 0
## Asthma Asthma 0 0.000 0 0 0 0
## KidneyDisease KidneyDisease 0 0.000 0 0 0 0
## SkinCancer SkinCancer 0 0.000 0 0 0 0
## type unique
## HeartDisease character 2
## BMI numeric 3604
## Smoking character 2
## AlcoholDrinking character 2
## Stroke character 2
## PhysicalHealth integer 31
## MentalHealth integer 31
## DiffWalking character 2
## Sex character 2
## AgeCategory character 13
## Race character 6
## Diabetic character 4
## PhysicalActivity character 2
## GenHealth character 5
## SleepTime integer 24
## Asthma character 2
## KidneyDisease character 2
## SkinCancer character 2
From the table we can see that physical and mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set report being physically sick in a 30 day period while about 60% of the respondents report feeling mentally unwell in that same time period.
We can also see that there are no N/A values in our data set. This is not surprising given that our data was cleaned before being retrieved.
Below are the plots showing the means of our numerical variables. We can see that, on average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in a 30 day period is slightly higher for people with heart disease, than for people without.
# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)
heart %>%
group_by(HeartDisease) %>%
summarise_at(vars("BMI",
"PhysicalHealth",
"MentalHealth",
"SleepTime"), mean)
## # A tibble: 2 × 5
## HeartDisease BMI PhysicalHealth MentalHealth SleepTime
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 No 28.2 2.96 3.83 7.09
## 2 Yes 29.4 7.81 4.64 7.14
# Changing Diabetes bordeline and pregnancy categories to Yes/No
heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"
# Changing all the categorical variables to numerical dummies
heart$HeartDisease<-ifelse(heart$HeartDisease=="Yes",1,0)
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)
According to Noel Bairey Merz, MD, director of the Barbra Streisand Women’s Heart Center in the Smidt Heart Institute, women experience heart disease more severely than men. Meaning that even though men are more likely to suffer heart attacks, women are more likely to die from a heart attack than a man.
We started off by trying to see which gender tends to have more heart disease based on symptoms reported just before diagnosis. From the table below, we can see that we have a lot more data for people without heart disease than with, this will be addressed later. However, even in the percentage table for those who do have heart disease, we see that there are more men with heart disease (about 5%) compared to women (3%). This is in line with our basic theory that men have more heart disease. However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? Discussions of the relevant variables are provided below.
contable = table(heart$HeartDisease, heart$Sex)
xkabledply(contable, title="Contingency Table for Heart Disease and Sex")
| Female | Male | |
|---|---|---|
| 0 | 156571 | 135851 |
| 1 | 11234 | 16139 |
addmargins(contable)
##
## Female Male Sum
## 0 156571 135851 292422
## 1 11234 16139 27373
## Sum 167805 151990 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)
#print ("Contingency Table for Heart Disease and Sex in Percentage")
print(percent, main="Contingency Table for Heart Disease and Sex in Percentage")
##
## Female Male
## 0 48.96 42.48
## 1 3.51 5.05
barplot(with(heart, table(HeartDisease, Sex)), main="Heart Disease Among Male & Female", beside=T, col=3:2)
In our exploratory analysis, we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease.
hist(heart$BMI,main="Histogram of BMI in Data Set", col='pink', xlab="BMI Levels")
In the histogram, we can see that our data is approximately normal and given the Central Limit Theorem, if n>30, then we can assume normality for the sake of statistical tests.
After sub-setting the data between male and female, we conducted a Welch’s t-test on the BMIs of both population subsets.
# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male <- subset (heart, Sex == "Male")
# T-test of BMI between Genders
t.test(female$BMI, male$BMI)
##
## Welch Two Sample t-test
##
## data: female$BMI and male$BMI
## t = -15, df = 3e+05, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.387 -0.299
## sample estimates:
## mean of x mean of y
## 28.2 28.5
The p-value is less than the 5% significance level, allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34.
Next, we divided our data frame between men and women who do have heart disease. Once again assuming normality due to the large sample size, we conducted another Welch’s T-test to look at the mean BMI difference among the population with heart issues.
# Subsetting genders based on respondents that have heart disease
female_HD<- subset(female, HeartDisease==1)
male_HD<- subset(male, HeartDisease== 1)
#T-test
t.test(female_HD$BMI, male_HD$BMI)
##
## Welch Two Sample t-test
##
## data: female_HD$BMI and male_HD$BMI
## t = -0.5, df = 20988, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.208 0.121
## sample estimates:
## mean of x mean of y
## 29.4 29.4
From the T-test output above, we see that the p-value is 0.60, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease.
Earlier, we saw that respondents, who have heart disease, reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. The box plot below shows us that women with heart disease generally report felling unwell more often than men.
ggplot(heart, aes(x = factor(HeartDisease), y = PhysicalHealth))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many times in the last 30 days have you felt physically ill?') +
xlab('Heart Disease') +
ylab('Physical Health')+
scale_fill_brewer(palette = 'Set2', name = 'Gender')
t.test(male_HD$PhysicalHealth, female_HD$PhysicalHealth)
##
## Welch Two Sample t-test
##
## data: male_HD$PhysicalHealth and female_HD$PhysicalHealth
## t = -16, df = 23059, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.53 -1.97
## sample estimates:
## mean of x mean of y
## 6.88 9.14
The t-test also confirms what the box plot is showing. Since the p-value is less than 5%, that there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick about 9 days while men reported just over 6 days.
ggplot(heart, aes(x = factor(HeartDisease), y = MentalHealth))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many times in the last 30 days have you felt mentally ill?') +
xlab('Heart Disease') +
ylab('Mental Health')+
scale_fill_brewer(palette = 'Set3', name = 'Sex')
t.test (male_HD$MentalHealth,female_HD$MentalHealth)
##
## Welch Two Sample t-test
##
## data: male_HD$MentalHealth and female_HD$MentalHealth
## t = -22, df = 21050, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.79 -2.34
## sample estimates:
## mean of x mean of y
## 3.59 6.16
The boxplot shows that women tend to report feeling mentally unwell more often than men.Since the p-value is less than 5%, indicating that there is a difference in the number of days that men and women feel mentally unwell. On average, women reported feeling mentally unwell about twice as many days as men. The cyclical nature of mental health and exercise has been covered extensively (Mikkelsen 2017; Raglin 1990); which assert that individuals struggling with mental health issues often are unable to complete physically demanding tasks, further making matters worse. This is because exercise has been shown to produce beneficial hormones that are associated with improvements in mental health.
ggplot(heart, aes(x = factor(HeartDisease), y = SleepTime))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle(' On average,how many hours of sleep do you get in a 24-hour period?') +
xlab('Heart Disease') +
ylab('Sleep Time')+
scale_fill_brewer(palette = 'Set', name = 'Gender')
t.test(male_HD$SleepTime,female_HD$SleepTime)
##
## Welch Two Sample t-test
##
## data: male_HD$SleepTime and female_HD$SleepTime
## t = 4, df = 22378, p-value = 2e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0515 0.1389
## sample estimates:
## mean of x mean of y
## 7.18 7.08
Visually it look like men and women get the same amount of sleep. However, the t-test shows that men and women have statistically different sleep times. Men on average get about 35 more minutes of sleep a week than women. This may not sound like much but sleep effects are known to be cumulative. A study conducted by Robbins et al. in 2021 “examined a group of people who got the recommended seven to eight hours of sleep a night to those who got less (even just 30 minutes less), and exposed them to a norovirus”. The study found that curtailing sleep below the recommended seven hours increases the risk for viral infection “nearly four times higher among those who are sleeping poorly or getting insufficient sleep, compared to those who are getting good rest” (Robbins et al, 2021).
This lack of sleep could explain why we see that women in our data on average report feeling sick more often and could also be a contributing factors is why women are more likely to die from heart attacks than men (Noel Bairey Merz).
#Smoking
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle(' Have you smoked at least 100 cigarettes in your entire life?') +
xlab('Heart Disease') +
ylab('Smoking')+
scale_fill_brewer(palette = 'Set2', name = 'Gender')
t.test ( male_HD$Smoking,female_HD$Smoking)
##
## Welch Two Sample t-test
##
## data: male_HD$Smoking and female_HD$Smoking
## t = 17, df = 23643, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0941 0.1178
## sample estimates:
## mean of x mean of y
## 0.629 0.523
The number of men and women smoking in our data set are in equal proportions. However, the t-test shows that men average smoke slightly more than women but this difference does not seem significant.
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many drinks have you had this week?') +
xlab('Heart Disease') +
ylab('Alcohol Consumption')+
scale_fill_brewer(palette = 'Set4', name = 'Gender')
t.test ( male_HD$AlcoholDrinking,
female_HD$AlcoholDrinking)
##
## Welch Two Sample t-test
##
## data: male_HD$AlcoholDrinking and female_HD$AlcoholDrinking
## t = 3, df = 25196, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.00133 0.01083
## sample estimates:
## mean of x mean of y
## 0.0442 0.0381
Note than in the questionnaire, heavy drinkers are defined as “adult men having more than 14 drinks per week and adult women having more than 7 drinks per week”. Even though, the number of drinkers are in equal proportions for both male and female, the number of drinks per between the populations are different. The threshold for being labelled a heavy drinker for women is half of that for a man.Therefore, the mean difference shown in the t-test might look small but on a per drink basis it is significant. Thus, we can conclude that men drink more on a weekly basis than women.
HD <- subset(heart, HeartDisease==1)
freq(HD$AgeCategory)
## var frequency percentage cumulative_perc
## 1 80 or older 5449 19.91 19.9
## 2 70-74 4847 17.71 37.6
## 3 65-69 4101 14.98 52.6
## 4 75-79 4049 14.79 67.4
## 5 60-64 3327 12.15 79.5
## 6 55-59 2202 8.04 87.6
## 7 50-54 1383 5.05 92.6
## 8 45-49 744 2.72 95.3
## 9 40-44 486 1.78 97.1
## 10 35-39 296 1.08 98.2
## 11 30-34 226 0.83 99.0
## 12 25-29 133 0.49 99.5
## 13 18-24 130 0.47 100.0
freq(HD$Race)
## var frequency percentage cumulative_perc
## 1 White 22507 82.22 82.2
## 2 Black 1729 6.32 88.5
## 3 Hispanic 1443 5.27 93.8
## 4 Other 886 3.24 97.0
## 5 American Indian/Alaskan Native 542 1.98 99.0
## 6 Asian 266 0.97 100.0
freq(HD$GenHealth)
## var frequency percentage cumulative_perc
## 1 Good 9558 34.92 34.9
## 2 Fair 7084 25.88 60.8
## 3 Very good 5381 19.66 80.5
## 4 Poor 3850 14.06 94.5
## 5 Excellent 1500 5.48 100.0
When looking at the population make-up of respondents who do have heart disease, we see that a higher percentage of respondents are 60 and older, with 82% being of Caucasian descent. At the time of the interview they also said they felt their overall health is good. This may be a limiting factor for future research since, once again, the sample does not have many data points from non-Caucasian individuals. We also see that most of our respondents with heart disease are older above 60 years of age, this is in-line with theory.
Even though a logit regression seems more appropriate to estimate whether or not an individual gets heart disease, we have used a linear regression in our exploratory stages to find the best fitting model.
reg_1 <- lm(HeartDisease~Sex+BMI, data=heart)
print('Model One BIC')
## [1] "Model One BIC"
BIC(reg_1)
## [1] 90503
summary (reg_1)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2521 -0.1024 -0.0805 -0.0587 0.9677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.03e-03 2.29e-03 2.2 0.028 *
## SexMale 3.85e-02 9.87e-04 39.0 <2e-16 ***
## BMI 2.20e-03 7.76e-05 28.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 319792 degrees of freedom
## Multiple R-squared: 0.0074, Adjusted R-squared: 0.00739
## F-statistic: 1.19e+03 on 2 and 319792 DF, p-value: <2e-16
reg_2 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + SleepTime, data=heart)
print('Model Two BIC')
## [1] "Model Two BIC"
BIC(reg_2)
## [1] 80963
summary (reg_2)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth +
## SleepTime, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3585 -0.0918 -0.0726 -0.0425 1.0001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.18e-02 3.41e-03 -6.40 1.5e-10 ***
## SexMale 4.22e-02 9.78e-04 43.10 < 2e-16 ***
## BMI 1.43e-03 7.70e-05 18.55 < 2e-16 ***
## PhysicalHealth 6.18e-03 6.41e-05 96.37 < 2e-16 ***
## MentalHealth -4.95e-04 6.44e-05 -7.69 1.5e-14 ***
## SleepTime 3.95e-03 3.41e-04 11.58 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.275 on 319789 degrees of freedom
## Multiple R-squared: 0.0367, Adjusted R-squared: 0.0367
## F-statistic: 2.44e+03 on 5 and 319789 DF, p-value: <2e-16
reg_3 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + Smoking + AlcoholDrinking+ SleepTime + AgeCategory + Race, data=heart)
print('Model Three BIC')
## [1] "Model Three BIC"
BIC(reg_3)
## [1] 60289
summary (reg_3)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth +
## Smoking + AlcoholDrinking + SleepTime + AgeCategory + Race,
## data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4570 -0.1220 -0.0529 -0.0020 1.0648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.16e-02 5.25e-03 -11.74 < 2e-16 ***
## SexMale 4.88e-02 9.57e-04 50.95 < 2e-16 ***
## BMI 2.05e-03 7.62e-05 26.92 < 2e-16 ***
## PhysicalHealth 4.61e-03 6.32e-05 72.95 < 2e-16 ***
## MentalHealth 9.31e-04 6.39e-05 14.56 < 2e-16 ***
## Smoking 3.50e-02 9.92e-04 35.27 < 2e-16 ***
## AlcoholDrinking -2.25e-02 1.89e-03 -11.93 < 2e-16 ***
## SleepTime -1.39e-03 3.33e-04 -4.16 3.1e-05 ***
## AgeCategory25-29 -6.29e-03 2.75e-03 -2.29 0.0222 *
## AgeCategory30-34 -6.45e-03 2.69e-03 -2.40 0.0166 *
## AgeCategory35-39 -5.83e-03 2.64e-03 -2.21 0.0269 *
## AgeCategory40-44 1.14e-03 2.63e-03 0.43 0.6643
## AgeCategory45-49 1.10e-02 2.61e-03 4.20 2.6e-05 ***
## AgeCategory50-54 2.92e-02 2.52e-03 11.57 < 2e-16 ***
## AgeCategory55-59 4.63e-02 2.44e-03 18.96 < 2e-16 ***
## AgeCategory60-64 6.98e-02 2.39e-03 29.17 < 2e-16 ***
## AgeCategory65-69 9.48e-02 2.39e-03 39.60 < 2e-16 ***
## AgeCategory70-74 1.32e-01 2.44e-03 54.00 < 2e-16 ***
## AgeCategory75-79 1.65e-01 2.65e-03 62.18 < 2e-16 ***
## AgeCategory80 or older 2.08e-01 2.58e-03 80.59 < 2e-16 ***
## RaceAsian -2.45e-02 4.75e-03 -5.15 2.5e-07 ***
## RaceBlack -2.01e-02 4.09e-03 -4.91 9.1e-07 ***
## RaceHispanic -1.75e-02 4.03e-03 -4.35 1.4e-05 ***
## RaceOther -1.30e-02 4.48e-03 -2.90 0.0037 **
## RaceWhite -2.04e-02 3.73e-03 -5.48 4.3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.266 on 319770 degrees of freedom
## Multiple R-squared: 0.0977, Adjusted R-squared: 0.0976
## F-statistic: 1.44e+03 on 24 and 319770 DF, p-value: <2e-16
With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low. Note that the 40-44 years age category has insignificant p-values. Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.
ezids::xkablevif(reg_3, wide=FALSE)
| V1 | |
|---|---|
| AgeCategory25-29 | 1.03 |
| AgeCategory30-34 | 1.06 |
| AgeCategory35-39 | 1.14 |
| AgeCategory40-44 | 1.17 |
| AgeCategory45-49 | 1.08 |
| AgeCategory50-54 | 1.02 |
| AgeCategory55-59 | 1.04 |
| AgeCategory60-64 | 1.72 |
| AgeCategory65-69 | 1.81 |
| AgeCategory70-74 | 1.89 |
| AgeCategory75-79 | 1.92 |
| AgeCategory80 or older | 1.95 |
| AlcoholDrinking | 2.10 |
| BMI | 2.28 |
| MentalHealth | 2.45 |
| PhysicalHealth | 2.47 |
| RaceAsian | 2.37 |
| RaceBlack | 1.99 |
| RaceHispanic | 2.10 |
| RaceOther | 2.51 |
| RaceWhite | 5.04 |
| SexMale | 5.77 |
| SleepTime | 3.00 |
| Smoking | 11.28 |
From, the VIF table, the VIF value for smoking is 11.28, this is undesirable and thus will be dropped. Therefore our preferred model is as follows:
HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory
We will train and test the model to see the cost analysis of our predictions.
In order to address the unbalanced classification, we subsetted our original data frame to create a new one (called Total) that had equal proportions of the Heart Disease category. The new dataset, called Total, has over 54000 observations that have been split into an 80-20 train and test sets.
# Creating a new dataframe (called total) with equal proportions of Heart Disease values
newdata1<- subset(heart, HeartDisease==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
## [1] 27373
newdata0<- subset(heart, HeartDisease==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
## [1] 27373
total <- rbind(newdata1_1, newdata0_1)
#nrow(total)
#summary(total)
#freq(total$HeartDisease)
# Convert to factor variables
total$HeartDisease<-factor(total$HeartDisease)
total$Sex<-factor(total$Sex)
# Split into 80-20 train-test sets
set.seed(1234)
sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test <- total[-sample, ]
# Logit Model Training
model<- glm(HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory, data= train, family='binomial')
#summary(model)
#Chi-squared test for model fit
anova(model, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: HeartDisease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 43795 60714
## Sex 1 704 43794 60010 < 2e-16 ***
## BMI 1 510 43793 59499 < 2e-16 ***
## PhysicalHealth 1 2416 43792 57083 < 2e-16 ***
## MentalHealth 1 28 43791 57055 9.4e-08 ***
## AlcoholDrinking 1 124 43790 56931 < 2e-16 ***
## SleepTime 1 10 43789 56921 0.0016 **
## AgeCategory 12 9273 43777 47647 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the anova chi-squared results, we can conclude that the logit model (HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory) used is a good fit model for predicting the probability of getting heart disease. All the coefficient p-values in the model are also significant, except for the 25-39 age range, which is significant at the 10% level.
# Exponential Coefficients & Confidence Interval
exp_coeff<- exp(cbind(OR = coef(model), confint(model)))
knitr::kable(exp_coeff,caption='Odds Ratio & Confidence Interval of Model')
| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.018 | 0.014 | 0.023 |
| SexMale | 2.106 | 2.014 | 2.203 |
| BMI | 1.041 | 1.037 | 1.044 |
| PhysicalHealth | 1.042 | 1.039 | 1.044 |
| MentalHealth | 1.021 | 1.018 | 1.024 |
| AlcoholDrinking | 0.797 | 0.722 | 0.879 |
| SleepTime | 0.947 | 0.933 | 0.960 |
| AgeCategory25-29 | 0.916 | 0.685 | 1.224 |
| AgeCategory30-34 | 1.602 | 1.246 | 2.068 |
| AgeCategory35-39 | 1.865 | 1.465 | 2.387 |
| AgeCategory40-44 | 3.091 | 2.463 | 3.908 |
| AgeCategory45-49 | 4.059 | 3.265 | 5.089 |
| AgeCategory50-54 | 7.385 | 5.990 | 9.188 |
| AgeCategory55-59 | 10.309 | 8.398 | 12.778 |
| AgeCategory60-64 | 14.405 | 11.766 | 17.810 |
| AgeCategory65-69 | 18.361 | 15.012 | 22.681 |
| AgeCategory70-74 | 27.403 | 22.402 | 33.855 |
| AgeCategory75-79 | 33.746 | 27.513 | 41.797 |
| AgeCategory80 or older | 51.239 | 41.794 | 63.438 |
According to the model, the odds for getting heart disease increases by 2.11 for a man compared to a woman. While an increase in BMI increases the odds for heart disease by 1.03. Mental health,physical health, drinking, and sleep time also affect your chances of getting heart disease but are not the most salient risk factors; they could be used as early indicators. As expected, the odds of getting heart disease for age increases as an individual gets older.
Also note, the confidence intervals for all the variables are narrow, indicating that they are good predictors of heart disease. However, the confidence intervals for age gets wider as we increase an individuals age. Although the uncertainty is greater at higher age ranges, there still may be enough precision to inform a patient about preventative measures, when considering their lifestyle choices.
library(ggthemes)
# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction <- predict( model, newdata = test , type = "response" )
# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) +
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) +
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) +
theme_economist()
# Positive instance = 1 = Yes Heart Disease
# Negative instance = 0 = No Heart Disease
After obtaining the predicted values that an individual will have heart disease on both the training and testing sets, a double density plot was drawn to show those predictions. Since we split our data evenly, the negative instances are skewed to the left while the positive instances are skewed to the right. This is an ideal density plot.
# user-defined different cost for false negative and false positive
source('unbalanced_functions.R')
cost.fp = 100
cost.fn = 200
roc_info <- ROCInfo( data = test,
predict = "prediction",
actual = "HeartDisease",
cost.fp = 100,
cost.fn = 200 )
grid.draw(roc_info$plot)
From the plots, we can see that our Area Under the Curve is about 79%, which is an acceptable model for our purposes. Specifying our different costs for false positives ($100) and false negatives ($200), we see that our total loss cost is just under $382000.
loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion Matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 14653 | 7276 | 21929 |
| Actual 1 | 4691 | 17176 | 21867 |
| Total | 19344 | 24452 | 43796 |
#unloadPkg("regclass")
#
cm_info <- ConfusionMatrixInfo( data = test,
predict = "prediction",
actual = 'HeartDisease',
cutoff = .30 )
#ggthemr("flat")
plot(cm_info$plot)
Calculated Values for Model:
Accuracy = 0.727 Mis-classification = 0.27 Sensitivity = 0.78 Specificity = 0.66
Using the optimal cut-off value of 0.30 obtained from the ROC-AUC Loss function, we see that our model a is able to calculate more positive values than negative values. The accuracy of the model is about 73%, which is not a good model in the healthcare field but can still be used as a public early detection tool to bring about individual awareness. We see that our model shows high sensitivity and low specificity, this means that our model is great at estimating actual cases of whether an individual will have heart disease but tends to have a high false positive rate. In the grand scheme life, higher false poistives is not as bad as having a false negative rate. For instance, an individual A uses our model and is told that they have a higher likelihood of having heart issues in the future. However, turns out this is a false positive and A just ends up living a bit more healthier than they did before. However, it is the false negatives that we will need to adjust for in the future.